NBA Shot Predictor

Oliver Lee

1. Data Collection and PreprocessingΒΆ

The goal of this project is to train a model to predict the likelihood a shot is made based on a variety of factors including shot location, shot type, player stats, and more. The main data used for training is found here: https://github.com/DomSamangy/NBA_Shots_04_25. This data contains every shot taken in the NBA from 2004-2025, with features such as player, shot type, shot location, etc.

Then merge this data with individual player statistics from the NBA API as shown below.

InΒ [2]:
import pandas as pd
from tqdm.notebook import tqdm
import time
from nba_api.stats.endpoints import PlayerDashboardByYearOverYear

Define function to fetch stats for a single playerΒΆ

This function uses the NBA API to get field goal %, 3-point %, and minutes played for a given player in the specified season.

InΒ [3]:
def get_player_stats(player_id, season='2024-25'):
    try:
        dash = PlayerDashboardByYearOverYear(player_id=player_id, season=season)
        df = dash.get_data_frames()[1]
        latest_season = df[df['GROUP_VALUE'] == season]
        stats = latest_season[['FG_PCT', 'FG3_PCT', 'MIN']].copy()
        stats['PLAYER_ID'] = player_id
        return stats
    except Exception as e:
        return None

Load the raw shot data and fetch stats for unique player IDsΒΆ

InΒ [Β ]:
original_df = pd.read_csv("./raw_data/NBA_2025_Shots.csv")
unique_ids = original_df['PLAYER_ID'].unique()
print(f"Loaded {len(original_df)} shot records for {len(unique_ids)} unique players.")

all_stats = []
failed_ids = []

for pid in tqdm(unique_ids, desc="Fetching Player Stats"):
    stats_df = get_player_stats(pid)
    if stats_df is not None:
        all_stats.append(stats_df)
    else:
        failed_ids.append(pid)
    time.sleep(0.5)  # Delay to avoid API rate limit

Merge the fetched stats with the original shot dataΒΆ

We'll combine all player stats, merge them with the original dataframe, then save the results.

InΒ [Β ]:
if all_stats:
    stats_combined = pd.concat(all_stats, ignore_index=True)
    merged_df = original_df.merge(stats_combined, on='PLAYER_ID', how='left')
    
    # Preview the merged data
    display(merged_df.head())
    
    # Save merged data to CSV
    merged_df.to_csv("./merged_data/24_25_allstats.csv", index=False)
    print(f"Saved merged stats to './merged_data/23_24_allstats.csv'.")
else:
    print("No player stats were retrieved.")

if failed_ids:
    print(f"Failed to fetch stats for {len(failed_ids)} players:")
    print(failed_ids)
else:
    print("Successfully fetched stats for all players.")

This merging process takes quite a while thanks to the API's rate limiting, but the final merged data will look like this (first 2 lines shown):

SEASON_1 SEASON_2 TEAM_ID TEAM_NAME PLAYER_ID PLAYER_NAME POSITION_GROUP POSITION GAME_DATE GAME_ID HOME_TEAM AWAY_TEAM EVENT_TYPE SHOT_MADE ACTION_TYPE SHOT_TYPE BASIC_ZONE ZONE_NAME ZONE_ABB ZONE_RANGE LOC_X LOC_Y SHOT_DISTANCE QUARTER MINS_LEFT SECS_LEFT FG_PCT FG3_PCT MIN
2024 2023-24 1610612764 Washington Wizards 1629673 Jordan Poole G SG 11-03-2023 22300003 MIA WAS Missed Shot False Driving Floating Jump Shot 2PT Field Goal In The Paint (Non-RA) Center C 8-16 ft. -0.4 17.45 12 1 11 1 0.413 0.326 2345.555
2024 2023-24 1610612764 Washington Wizards 1630166 Deni Avdija F SF 11-03-2023 22300003 MIA WAS Made Shot True Jump Shot 3PT Field Goal Above the Break 3 Center C 24+ ft. 1.5 30.55 25 1 10 26 0.506 0.374 2256.6433333

2. Training a Random Forest ClassifierΒΆ

InΒ [2]:
import pandas as pd
import numpy as np
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Removing Unrelated FeaturesΒΆ

Some features should be removed before training, as they should have no impact on the shot outcome. We also drop PLAYER_ID here, but keep PLAYER_NAME as an easier way to indetify each player. Y consists of SHOT_MADE, the target prediction label for this experiment.

InΒ [3]:
df = pd.read_csv('./raw_data/NBA_2025_Shots.csv')
df = df.drop(columns=['SEASON_2', 'GAME_ID', 'ZONE_ABB', 'EVENT_TYPE', 'GAME_DATE',
                     'PLAYER_ID', 'TEAM_ID', 'TEAM_NAME'])

X = df.drop(columns=['SHOT_MADE', 'PLAYER_NAME'])
y = df['SHOT_MADE'].astype(int)
X_encoded = pd.get_dummies(X)
X_encoded['PLAYER_NAME'] = df['PLAYER_NAME']

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded.drop(columns=['PLAYER_NAME']),  # Remove PLAYER_ID for training
    y, 
    test_size=0.2, 
    stratify=y,
    random_state=42
)

test_player_ids = X_encoded.iloc[X_test.index]['PLAYER_NAME']

Finally, we train the random forest with x and y, and store the model for analysis. For this project, I used a model trained specifically on the 24-25 season, and tested the model on data from previous years.

InΒ [4]:
model = RandomForestClassifier(n_estimators=100, random_state=42, verbose=0)
model.fit(X_train, y_train)

joblib.dump({
    'model': model,
    'test_player_ids': test_player_ids,
    'feature_names': X_train.columns
}, './models/random_forest_24_25.joblib')
Out[4]:
['./models/random_forest_24_25.joblib']

3. Testing Model PerformanceΒΆ

InΒ [5]:
import pandas as pd
import numpy as np
import joblib

Load the stored model and merged data from a different season (in this case, 24-25 model on 23-24 data). Then drop the same columns from this data.

InΒ [6]:
model_data = joblib.load('./models/random_forest_24_25.joblib')
model = model_data['model']
trained_features = model_data['feature_names']

df = pd.read_csv('./raw_data/NBA_2024_Shots.csv')
df = df.drop(columns=['SEASON_2', 'GAME_ID', 'ZONE_ABB', 'EVENT_TYPE', 'GAME_DATE',
                     'PLAYER_ID', 'TEAM_ID', 'TEAM_NAME'])

Now we can see the model's predictions and search by any desired metrics. For preliminary testing, I created predicitons for some individual players, just displaying accuracy as well as each example that was incorrectly classified. (Limited to 5 examples here)

InΒ [8]:
player_name = "Immanuel Quickley"

player_rows = df[df['PLAYER_NAME'] == player_name].copy()
player_y = player_rows['SHOT_MADE'].astype(int)

player_X = player_rows.drop(columns=['SHOT_MADE', 'PLAYER_NAME'])
player_X_encoded = pd.get_dummies(player_X)
player_X_encoded = player_X_encoded.reindex(columns=trained_features, fill_value=0)

predictions = model.predict(player_X_encoded)
probabilities = model.predict_proba(player_X_encoded)[:, 1]

player_rows = player_rows.assign(
    PREDICTED_MADE=predictions,
    PREDICTED_PROB=probabilities
)

correct = np.sum(player_y.values == predictions)
total = len(player_y)
print(f"Correct Predictions: {correct} / {total}")
print(f"Accuracy for {player_name}: {correct / total}")

# find examples that were incorrectly classified
mismatches = player_rows[player_rows['SHOT_MADE'] != player_rows['PREDICTED_MADE']]

print("\nMismatched Predictions (SHOT_MADE != PREDICTED_MADE):")
print(mismatches[[
    'PLAYER_NAME', 
    'ACTION_TYPE', 
    'SHOT_TYPE', 
    'SHOT_DISTANCE', 
    'ZONE_NAME',
    'SHOT_MADE', 
    'PREDICTED_MADE', 
    'PREDICTED_PROB'
]].head(5).to_string())
Correct Predictions: 537 / 894
Accuracy for Immanuel Quickley: 0.6006711409395973

Mismatched Predictions (SHOT_MADE != PREDICTED_MADE):
             PLAYER_NAME                      ACTION_TYPE       SHOT_TYPE  SHOT_DISTANCE         ZONE_NAME  SHOT_MADE  PREDICTED_MADE  PREDICTED_PROB
17577  Immanuel Quickley                        Jump Shot  3PT Field Goal             25  Left Side Center       True               0            0.24
17588  Immanuel Quickley                        Jump Shot  3PT Field Goal             27  Left Side Center       True               0            0.44
17611  Immanuel Quickley       Driving Floating Jump Shot  2PT Field Goal             11         Left Side       True               0            0.44
17626  Immanuel Quickley  Driving Floating Bank Jump Shot  2PT Field Goal             13        Right Side       True               0            0.42
17650  Immanuel Quickley       Driving Floating Jump Shot  2PT Field Goal              9            Center       True               0            0.41

4. Generating Unique Visualizations and MetricsΒΆ

Expected vs. Actual PointsΒΆ

One statistic I wanted to focus on is the notion of 'shot selection', specifically looking at the proportion of shots a player takes that the model predicts to be a miss or a make. Additionally, can look at how often the player makes a shot he is predicited to miss (a 'difficult' shot), or vice versa.

InΒ [43]:
X = df.drop(columns=['SHOT_MADE', 'PLAYER_NAME'])
X_encoded = pd.get_dummies(X)
X_encoded = X_encoded.reindex(columns=trained_features, fill_value=0)

df['PREDICTED_PROB'] = model.predict_proba(X_encoded)[:, 1]
df['PREDICTED_MADE'] = model.predict(X_encoded)

point_map = {
    '2PT Field Goal': 2,
    '3PT Field Goal': 3
}

df['SHOT_VALUE'] = df['SHOT_TYPE'].map(point_map)
df['ACTUAL_POINTS'] = df['SHOT_MADE'] * df['SHOT_VALUE']
df['EXPECTED_POINTS'] = df['PREDICTED_PROB'] * df['SHOT_VALUE']

player_summary = df.groupby('PLAYER_NAME').agg(
    total_shots=('SHOT_MADE', 'count'),
    actual_points=('ACTUAL_POINTS', 'sum'),
    expected_points=('EXPECTED_POINTS', 'sum'),
)

player_summary['points_above_expected'] = player_summary['actual_points'] - player_summary['expected_points']
top10 = (
    player_summary
    .sort_values(by='points_above_expected', ascending=False)
    .head(10)
)

print("Top 10 Players by Points Above Expected:")
print(top10.round(2).to_string())

import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'

player_summary = player_summary.reset_index()

fig = px.scatter(
    player_summary,
    x='expected_points',
    y='points_above_expected',
    hover_name='PLAYER_NAME',
    opacity=0.9,
    labels={
        'expected_points': 'Expected Points',
        'points_above_expected': 'Points Above Expected',
        'total_shots': 'Total Shots'
    },
    title='Player Expected Points vs. Points Above Expected',
    template='simple_white'
)

fig.add_shape(
    type="line",
    x0=player_summary['expected_points'].min(),
    x1=player_summary['expected_points'].max(),
    y0=0,
    y1=0,
    line=dict(color='gray', dash='dot', width=1)
)

fig.update_layout(
    template = "seaborn",
    font=dict(family="Helvetica", size=14, color='black'),
    title_font=dict(size=22, family="Helvetica", color='black'),
    height=700,
    width=950,
    margin=dict(l=60, r=30, t=70, b=60),
    xaxis=dict(
        showgrid=True,
        gridcolor='rgba(200,200,200,0.2)',
        zeroline=False,
        linecolor='rgba(0,0,0,0.3)'
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor='rgba(200,200,200,0.2)',
        zeroline=False,
        linecolor='rgba(0,0,0,0.3)'
    ),
    hoverlabel=dict(
        bgcolor='white',
        font_size=14,
        font_family='Helvetica'
    )
)
fig.show()
Top 10 Players by Points Above Expected:
                         total_shots  actual_points  expected_points  points_above_expected
PLAYER_NAME                                                                                
Luka Doncic                     1652           1892          1664.76                 227.24
Nikola Jokic                    1411           1727          1510.66                 216.34
Kevin Durant                    1436           1670          1509.19                 160.81
Stephen Curry                   1445           1657          1504.03                 152.97
Kawhi Leonard                   1162           1360          1208.93                 151.07
Jalen Brunson                   1648           1791          1641.07                 149.93
Shai Gilgeous-Alexander         1487           1687          1552.75                 134.25
CJ McCollum                     1055           1207          1073.40                 133.61
Kyrie Irving                    1131           1297          1164.22                 132.78
Paul George                     1236           1407          1279.42                 127.58

We use plotly to create an interactive scatterplot - hovering over one of the dots will show who that player is, their expected points, as well as their true points above the expected. This plot seems to make some sense, the players with the highest points above expected are some of those generally considered to be the current top players (Nikola Jokic, Luke Doncic, Kevin Durant, etc.)

One thing to consider is that this plot works better for players who have taken more shots, since it doesn't take into account any sort of proportion. Rather, only those with a certain amount of shots taken will have any sort of meaningful trend to be considered.

Shot Difficulty ProportionΒΆ

Another metric that can be considered using this model is the proportion of shots deemed to be "difficult" (P(SHOT_MADE) < 0.5) by different players. We can then see what kind of shots specific players like to take, but also we can look at other features such as the shot clock or shot type to see what other factors tend to produce difficult or easy shots.